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NON-CODING RNA GENES AND 
THE MODERN RNA WORLD 



Sean R. Eddy 

Non-coding RNA (ncRNA) genes produce functional RNA molecules rather than encoding 
proteins. However, almost all means of gene identification assume that genes encode proteins, 
so even in the era of complete genome sequences, ncRNA genes have been effectively 
invisible. Recently, several different systematic screens have identified a surprisingly large 
number of new ncRNA genes. Non-coding RNAs seem to be particularly abundant in roles 
that require highly specific nucleic acid recognition without complex catalysis, such as in 
directing post-transcriptional regulation of gene expression or in guiding RNA modifications. 
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One goal of genome projects is to systematically 
identify genes'. In the past year, two papers have 
announced drafts of the human genome sequence 2,3 , 
but the estimated number of human genes continues 
to fluctuate. Current estimates centre on 
30,000-40,000 genes, with occasional excursions to 
100,000 or more 4-6 . One reason for the continuing 
ambiguity is that genes are neither well defined nor 
easily recognizable. The numerology is based on 
three methods: cDNA cloning and expressed 
sequence tag (EST) sequencing of polyadenylated 
mRNAs 7,8 ; identification of conserved coding exons 
by comparative genome analysis 9 ; and computational 
gene prediction 2,3 . These methods work best for large, 
highly expressed, evolutionary conserved protein- 
coding genes, and they almost certainly underesti- 
mate the number of other genes. They essentially do 
not work at all for one class of genes — the non-cod- 
ing RNA (ncRNA) genes, which produce transcripts 
that function directly as structural, catalytic or regu- 
latory RNAs, rather than, expressing mRNAs that 
encode proteins 10 " 12 (see BOX l for a list of abbrevia- 
tions that are used to describe classes of RNA). 
Knowledge of ncRNAs has been limited to biochemi- 
cally abundant species and anecdotal discoveries. 
Even after the completion of many genome 
sequences, both the number and diversity of ncRNA 
genes remain largely unknown. 

Could it be possible that a large class of genes has 



gone relatively undetected because they do not make 
proteins? How many ncRNA genes are there? How 
important are they? What functions does a cell delegate 
to RNA instead of protein, and why? 

To address these questions, new systematic gene- 
discovery approaches need to be developed that are 
specifically aimed at ncRNAs. A pioneering study by 
Roy Parker's group found a few new RNA genes and 
small open reading frames (ORFs) in the yeast genome 
by doing northern blots that probed for expressed tran- 
scripts in 'grey holes 1 (suspiciously large intergenic 
regions), and by searching for consensus RNA poly- 
merase III promoters 13 . Recently, several groups have 
carried out systematic ncRNA gene-identification 
screens along three main lines: cDNA cloning and 
sequencing tailored to find new small non- mRNAs 14 ; 
' specially designed cDNA cloning screens for a new reg- 
ulatory RNA gene family of tiny RNAs called 
microRNAs (miRNAs) ,s - 17 ; and general ncRNA gene- 
finding exercises using computational comparative 
genomics in Escherichia co/i' 18-20 . The results of these 
screens are startling. All of them indicate that the preva- 
lence of ncRNA genes has indeed been underestimated. 

The idea that a class of genes might have remained 
essentially undetected is provocative, if not heretical. It is 
perhaps worth beginning with some historical context 
of how ncRNAs have so for been discovered. Gene dis- 
covery has been biased towards mRNAs and proteins 
for a long time. 
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Box 1 | Abbreviations for different classes of non-coding UNA 

• fRNA '.. J" y. ;.. 

Functional RNA— essentially synonymous vdthnori-OTdingRNA^v;- ; i: 

• miRNA. ' :};'. '}]' /.> ' ';• V" 
MicroRNAr— ^putaUve'translationalre^atory gene family ,[/ ; 

• ncRNA .;, ■,, ' , ' ■ \. ; ••:5^^: : ,;^ r '.yy \ • : "> ■ 
'. Non-coamgi^— ^.RNAsd^ : V.e /, - s J: : .n 

• rRNA • ; ■ ^ ■ • : ."' \ , V -'(H^i, 'V- i 
RibosomalRNA . < « r : 

. • siRNA •'• ' V >% V \y-.'/* ' 

Small interfering RNA — active molecules in ^A interference r ; • : ' ^ J : ? • ! 

■ • snRNA ■ ' .. ;V '// -V ' 

Small nuclear RNA — includes spliceosomal RNAs ; 

• snmRNA ; ;'■ 

Small non-mRNA — essentially synonymous with small ncRNAs 14 

• snoRNA , . \_" •/• V ■ v. •" ! •-"■\*> "/ 
Small nucleolar RNA— most known snoRNAs are involved in rRNA" modification 

• stRNA ; - " . 'fl • . J-!-'./ ";; - r ;; ? :v. -t v\ 

Small temporal RNA— for example, /in-4 and Ut-fmCa^norhab 

•• tRNA • "' '; ' ' •-' r - ." : ■ f\" : 

Transfer RNA ..' .- 



The lessons of history 

The central role of RNA in translation. It was clear by 
the 1950s that although DNA was located in the eukary- 
otic nucleus, proteins were being synthesized in the 
cytoplasm in the presence of abundant RNA 2U2 . Most 
of this cellular RNA could be found in discrete particles 
in the cytoplasm 23 , which were later shown to be the site 
of protein synthesis and called ribosomes 24 . James 
Watson sketched the "central dogma* as early as 1952 
(refs 25,26), imagining that there must be a coding RNA 
that is passed from the DNA to the protein synthetic 
machinery in the cytoplasm. The prevailing theory was 
the now- forgotten "one gene, one ribosome, one pro- 
tein" hypothesis 24,27 that each gene produced a special- 
ized ribosome composed of a specific mRNA that was 
associated with general ribosomal proteins that cata- 
lysed translation. Various results undermined this 
hypothesis, including the simple observation that 
although genes came in a great variety of sizes and base 
compositions, ribosomal RNAs had no variety 27 . Finally, 
ribosomes were found to be general-purpose RNA/pro- 
tein machines, composed largely of stable rRNAs 28 , and 
programmed with various unstable mRNAs that are 
only a small fraction of the total RNA population 27 - 29 . 

The second class of functional RNA was predicted by 
Francis Crick's "adaptor" hypothesis 24 . Crick predicted 
the existence of a molecule that mediates between the 
triplet genetic code and the encoded amino acid. 
Interestingly, Crick argued not only that the adaptor 
would be an RNA, but also that RNA would be evolu- 
tionary preferred over protein as the material for his 
adaptors, because base pairing made RNA uniquely suit- 
ed for a role as a small, specific RNA recognition mole- 
* cule 24 . Cricks adaptors had in feet just been biochemical- 
ly observed by Mahlon Hoagland and co-workers 30 . 
These RNAs later proved to be Crick's adaptors — the 
transfer RNAs 31 . 



U RNA 

Small nuclear RNA in 
eukaryotes. The first such RNAs 
to be found were rich in uridine 
(U), and the name stuck. 

NUCLEOLUS 

A highly organized nuclear 
organelle that is the site of 
ribosomal RNA processing and 
ribosome assembly. 

RIBONUCLEASEP 
A universally conserved enzyme 
that cleaves a leader sequence 
from tRNA precursors. 

SIGNAL RECOGNITION 
PARTICLE 

An RNA-protein complex 
involved in exporting secreted 
proteins from the cell. 



RNA therefore changed from being thought of as 
having one 'flavour' (the purely information-caiTying 
intermediate in the "central dogma") to having three 
flavours, all apparently involved in making protein: 
rRNA, tRNA, and everything else, which was 
assumed to be mRNA. Genetics and enzyme bio- 
chemistry had already shown links between mutant 
genes, missing enzymatic activities and missing or 
altered proteins. The central intellectual problem was 
to solve the genetic code. The non-rRNA/non-tRNA 
fraction was complex, non-abundant and mostly 
unstable, and there was little motivation or ability to 
go any further and ask whether it contained more 
than mRNA. 

RNA comes in more than three flavours. Several abun- 
dant, small non-mRNAs, other than rRNA and tRNA, 
were detected and isolated biochemically, among them 
the uridine (U)-rich u rnas 32 - 33 . Many of these small 
RNAs are associated with proteins to form ribonudeo- 
protein (RNP) complexes 34 . Characterization of small 
RNPs was aided by the discovery that certain patients 
with autoimmune diseases, such as systemic lupus 
erythematosus, produce anti-RNP autoantibodies that 
could be used to immunoprecipitate small RNPs 35 . 
Many of the abundant small RNPs precipitated by 
these antisera, namely Ul, U2, U4, U5 and U6 small 
nuclear RNA (snRNA), turned out to be components 
of the spliceosome, involved in splicing mRNAs 34 - 36 . 
Other U RNAs — U4atac, U6atac, U 1 1 and U 1 2 — 
have been found to be components of a second 
spliceosome species 37 - 38 . 

Many other small RNAs have been isolated bio- 
chemically. Sometimes these isolations have been delib- 
erate, such as the isolation of numerous, small nucleo- 
lar RNAs (snoRNAs) from nucleoli 39 . In other cases, 
biochemical fractions were unexpectedly found to con- 
tain essential RNAs, as in the case of ribonucleasep 40 . 
One of the best examples of such a surprise resulted in 
the renaming of the signal recognition 'protein' to the 
signal recognition 'particle* (SRP), when it was unex- 
pectedly found to contain a 7S RNA that is now called 
SRP-RNA 41 * 42 . * 

New RNAs continue to appear; among the more 
fascinating stories is the discovery that RNAs have 
roles in chromatin structure 43 . A canonical example is 
the human XIST (X( inactive) -specific transcript) 
RNA, a 17-kb ncRNA with a key role in dosage com- 
pensation and X-chromosome inactivation 44 . 
Drosophila melanogaster also seems to control dosage 
compensation using small chromatin- associated roX 
(RNA on the X) RNAs 45 . Several large ncRNAs have 
been found to be expressed from imprinted regions of 
vertebrate chromosomes, including the IPW 
(imprinted in Prader-Willi syndrome) and Hi 9 
(HI 9, imprinted maternally expressed untranslated 
mRNA) transcripts 46,47 . (The imprinted Prader-Willi 
crucial region seems to be especially rich in 
ncRNAs 48 - 49 , although it is unclear whether this is 
peculiar, or simply due to the incredibly intense gene 
hunting in search of the elusive cause of 
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RNA PROCESSING 
A general term for the 
maturation of a precursor RNA; 
includes the processes of RNA 
splicing, RNA modification, 
RNA editing and RNA cleavage. 

CAJAL BODIES 

(also known as coiled bodies). 
Nuclear organelles of unknown 
function, named in honour of 
RamdnyCajal. 

RNA TAILING 
A technique in which an 
artificial homopolymer 
sequence is enzymatically 
added to an RNA to facilitate 
molecular cloning, as opposed 
to relying on the presence of a 
natural poly-A tail. 
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Figure 1 | Diagrams of snoRNAs guiding modification to 
target rRNA bases, a | C/D box smaD nucleolar RNAs 
(snoRNAs) use antisense complementarity to target RNA for 
2'-0-ribose methylation (site marked with 'm' and red dot). R 
stands for A or G (purine), b | H/ACA box snoRNAs use 
antisense complementarity in an interior loop to target RNA 
for pseudouridyiattan (site marked W). Redrawn with 
permission from REE 78. 



Prader-Willi.) Many of these other RNAs are ris-anti- 
sense RNAs that overlap coding genes on the other 
genomic strand. Various os-antisense RNAs have been 
observed in prokaryotes 50 , plants 51 and animals 12 , and 
their roles are unlikely to be limited to those in 
imprinting and chromatin structure. Mutations in 
one ris-antisense RNA in humans — SCA8 (spino- 
cerebellar ataxia 8) — are found in patients with 
spinocerebellar ataxia 52 . 

Continued flurries of small nucleolar RNAs 

The nucleolus is rich in snoRNAs, most of which are 
-70-250 nucleotides in length 53,54 . Some snoRNAs 
have roles in ribosomal rna processing, but most func- 
tion in rRNA modification 39 . On the basis of weak 
sequence similarities, almost all snoRNAs fall into two 
families: the 'C/D box' snoRNAs and the 'H/ACA' 
snoRNAs 39,55 . The C/D box snoRNAs use base com- 
plementarity to guide site-specific 2'-Oribose methy- 
lations to rRNA 56 " 53 , whereas the H/ACA snoRNAs 
use base complementarity to guide site- specific 
pseudouridylations to rRNA 59 - 60 (FIG. l). In both cases, 
the catalytic function seems to be provided by a pro- 
tein methylase or pseudo-U synthetase associated 
with the snoRNA, and the specificity for the target 
base on the rRNA is provided by base complementari- 
ty to the snoRNA 6 '- 43 . 



For many eukaryotes, the approximate number of 
specific 2'-0-ribose methylations and pseudouridyla- 
tions is known, and for some species, many modified 
positions have been precisely mapped 64-66 . In human 
rRNAs, for instance, there are ~ 100 -1 10 of each type of 
modification, and in yeast, about 50 of each. If 
snoRNAs direct most (or all) eukaryotic nuclear rRNA 
2'-0-ribose methylations and pseudouridylations, 
there must be a large number of undiscovered 
snoRNAs. Indeed, computational screens have revealed 
41 new C/D snoRNAs in the yeast genome 67 and more 
than 60 new C/D snoRNAs in the Arabidopsis thaliana 
genome 68,69 . Immunoprecipitation with antibodies 
against fibrillarin (the putative methyltransferase) 
revealed 17 new C/D snoRNAs in Trypanosoma 
brucei 70 , and cDNA sequencing has found 72 new C/D 
snoRNAs and 41 new H/ACA snoRNAs in the mouse 
(see below) 14 . Numerous homologues of the C/D 
snoRNAs have been found in the Archaea 71 ' 72 , in which 
they are presumed to have the same function in guiding 
specific 2'-Oribose methylations of target RNAs. 

In addition to rRNA, other structural RNAs — 
such as tRNAs and snRNAs — are known to be 
extensively modified 33 ' 73 - 74 , and it now seems that 
some, if not many, of these modifications are also 
guided by snoRNA. At least one of the 2'-0-ribose 
methylations of Xenopus laevis U6 snRNA is guided 
by the C/D snoRNA, mgU6-77 (REE 73). Human U85 
is a chimeric C/D, H/ACA snoRNA (a 'Siamese' 
snoRNA) that guides both a methylation and a 
pseudouridylation of U5 snRNA 75 . Furthermore, 
sequencing of snoRNA-enriched cDNA libraries has 
revealed several 'orphan* snoRNAs with no obvious 
rRNA target 49,76,77 , as have the computational screens 
for archaeal snoRNAs 71 , for which a few such cases 
have been putatively assigned to known tRNA 2'-0- 
ribose methylations. One puzzling aspect of these 
discoveries is that one has to wonder how non-rRNAs 
are transported through the nucleolus, or whether 
perhaps there is at least one more site of RNA modifi- 
cation in the cell. Recent evidence indicates that the 
snRNA modifications are associated with cajal bodies 
(coiled bodies) in the nucleus 78 . 

An EST screen for small non-mRNAs. Alexander 
HUttenhofer and colleagues 14 undertook a general 
screen for new, small non-mRNAs, using an EST 
sequencing approach. The RNA population used was 
total (not cytoplasmic) mouse brain RNA that was 
cloned by rna tailing (not by poly-A selection and dT 
priming) and size selected in two small RNA frac- 
tions — 50-100 nucleotides and 110-500 
nucleotides. High -throughput filter hybridization 
was used to screen out clones that correspond to 
tRNA, rRNA fragments and other known ncRNAs, 
increasing the fraction of new ncRNA sequences 
from -3-7% in an unscreened library to -20-22% 
after screening. A total of -5,000 clones were 
sequenced, and after accounting for several sequences 
of the same RNA species, 201 new RNA sequences 
were identified. 
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Figure 2 1 Examples of proposed interactions between the Caertortiabditis elegans lin-4 microRNA and a target 

mRNA. lineage-4 IJin-4) is proposed to Interact by base pairing with a,b | seven sites in the 3' untranslated region (UTR) of lin-14 
mRNA (first two of the seven sites are shown) 129 and c | one site in the 3' UTR of lin-28 mRNA 84 . A C residue (in red) is predicted 
to be bulged in four out of the seven lin- 14 interactions, including the two shown; this C is mutated to U in the strong loss-of- 
function/Mma767 allele 82 . 



HETEROCHRONIC MUTATION 
A mutation that alters the 
timing of developmental 
events, such as the sequence of 
larval moults in nematodes. 



A few more than half of the new sequences seem to 
be new snoRNAs — 72 new C/D snoRNAs and 41 new 
H/ACA snoRNAs. Of these, several are orphans that do 
not have obvious rRNA or snRNA targets. Some of 
these snoRNAs showed brain-specific expression, 
which would not be predicted for molecules involved 
in ubiquitous rRNA modification. The human homo- 
logues of three of these snoRNA genes mapped to the 
crucial region for Prader-Willi syndrome, two of which 
(HBII-52 and HBII-85) are C/D snoRNAs found in 
multicopy tandem arrays, unlike most vertebrate 
snoRNAs, which are found in single copies in the 
introns of other genes. Both HBII-52 and HBII-85 are 
expressed as imprinted genes only from the paternal 
chromosome, as expected for a Prader-Willi candidate 
gene. The HBII-85 array, located just to the left of (cen- 
tromeric to) the non-coding XPWgene, was also detect- 
ed as an imprinted ncRNA gene array by other stud- 
ies 48 * 79 . The HBII-52 snoRNA has a perfect 18-bp 
complementarity to 5-hydroxytryptarnine 2C (5-HT^) 
receptor mRNA, and is predicted on that basis to 
methylate a site of known mRNA editing; this indicates 
a complex set of interactions in which a snoRNA might 
regulate the editing of an mRNA transcript 1 ** 80 . 

Out of the 88 sequences that did not seem to be 
snoRNAs, 20 that did not correspond to known mRNAs 
or repetitive elements were confirmed as expressed, 
small, discrete RNAs by northern blots, with sizes rang- 
ing from 65 to 500 nucleotides. The functions of these 
20 new small RNAs are unknown. Httttenhofer and co- 
workers are now analysing similar libraries from 
Caenorhabditis elegans, D. melanogaster and A thaliana, 

MIcroRNAs: one, two ... Infinity? 

One: lin-4. A canonical example of the identification of 
a ncRNA gene by genetics is the story of the lin-4 regu- 
latory RNA in the nematode C. elegans. The lin-4 locus 
was identified in a screen for mutations that affect the 
timing and sequence of postembryonic development 
(heterochronic mutations) in C. elegans". Mutant ani- 
mals reiterate the LI larval stage rather than progress to 
later stages of development. The gene was positionally 
cloned by isolating a 693-bp DNA fragment that could 
rescue the phenotype of mutant animals 82 . The paper by 
Rosalind Lee and colleagues dryly recounts a careful 
detective story, as Victor Ambros's lab gradually realized 
that they were dealing not with a protein-coding gene, 



but with a tiny ncRNA. The lin-4 gene product is.a 22- 
nudeotide RNA, processed from a 61 -nucleotide pre- 
cursor RNA with a putative stem-loop structure. 

Genetically, lin-4 acts as a negative regulator of hete- 
rochronic protein-coding genes such as lin-14 and 
lin-28. The 3' untranslated regions (UTRs) of the target 
genes have short stretches of complementarity to lin-4 
(refs 82-84; fig. 2). Deletion of these apparent lin-4 target 
sequences causes an unregulated gain-of-function phe- 
notype 83 ** 4 . The lin-4 RNA inhibits accumulation of the 
UN- 14 and LIN-28 proteins by an unknown mecha- 
nism. The target mRNA remains stable, fully 
polyadenylated and polysome associated 85 . 

Two: let-7. The lin-4 gene remained an oddity until a 
second heterochronic gene, lethal-7 (let-7), also mapped 
to a ncRNA gene with a 2 1 -nucleotide product 86 . The 
small let-7 A is also thought to be a post-transcrip- 
tional negative regulator, possibly targeting the protein- 
coding mRNAs for lin-41 and lin-42, based on pheno- 
typic analysis and plausible complementary sequences 
in the 3' UTR of these genes. 

Surprisingly, Amy Pasquinelli et al" showed that let-7 
was almost 100% conserved and expressed as a small 21- 
nucleotide RNA in all bilaterally symmetrical animals 
that were tested, including human, mouse, chicken, poly- 
chaete worms and flies, but not in cnidarians (jellyfish) 
or poriferans (sponges). The function of these let-7 
homologues is unknown, but because they show tempo- 
ral regulation that is generally similar to the develop- 
mental pattern of /ef-7in the worm, one presumes that 
they also function in post-transcriptional regulation of 
developmental genes. Pasquinelli et al. proposed the 
name "small temporal RNAs" (stRNAs) for genes such 
as lin-4 and let-7, and suggested that others might 
be found. 

A surprising link to RNA interference. Meanwhile, the 
increasingly baroque phenomenology of double- 
stranded RNA interference (RNAi) was being elucidat- 
ed 88 * 91 . The introduction of exogenous double-strand- 
ed RNA (dsRNA) into nematodes, by direct injection 
or even by feeding, leads to the specific, rapid degrada- 
tion of homologous mRNA(s), and a loss-of-function 
phenotype. RNAi also works in many other organisms, 
including plants, in which the effect has been called co- 
suppression or post-transcriptional gene silencing 89,91 . 
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Figure 3 1 Three examples of microRN As. Proposed 
structure of the precursor stem is shown, with residues in the 
mature microRNA (miRNA) shown in red. Comparison of 
Caenorhabditis slogans miR-1 (REFS 15,16) with Drosophila 
meJanogaster miR-1 (REE 17) shows perfect conservation of 
the mature miRNA (except for length variability at the 3' end). 
Comparison of miR-1 with miR-84 (REE 16) shows an 
example of how mature miRNAs are produced 
asymmetrically from either side of the precursor stem. 



The input dsRNA is cleaved to form the active agents of 
the RNAi effect— tiny 2 1-25-nudeotide small inter- 
fering RNAs (siRNAs) 92 - 94 . Several proteins that are 
important in the RNAi pathway have been identified, 
including the putative processing nuclease Dicer and a 
large family of homologous proteins including 
Caenorhabditis RDE- 1 , Arabidopsis ARGONAUTS and 
Drosophila Piwi. RNAi has been suggested to function 
as a primitive immune system against RNA viruses and 
retrotransposons 90,9, . 

Many people noted with suspicion that the sizes of 
the active lin-4 and let-7 stRNAs (22 and 21 
nucleotides, respectively) are the same as those of the 
siRNAs 87,90,95 . Indeed, the RNAi-processing pathway 
shares components with the stRNA-processing path- 
way. Knocking down Dicer function in human cul- 
tured cells leads to accumulation of the 72-nudeotide 
unprocessed human kf-7precursor 93 . Knocking down 
either the function of the C. elegans Dicer homologue 
or 2 of the 23 worm homologues of the rde-ll 
ARGONAUTEI piwi gene family — alg-l (argonaute- 
like gene 1) and alg-2 — results in accumulation of 



unprocessed /m-4and let-7 precursors 96 . In the course 
of cloning and analysing the small RNAs produced 
from an exogenous dsRNA, Thomas TuschTs lab noted 
in passing that Drosophila seemed to contain endoge- 
nous 21- and 22-mers 94 , and suggested that perhaps 
there were naturally occurring siRNAs. 

Introducing the microRNAs. Now, three papers show 
that, indeed, lin-4 and let-7 are not alone — they 
belong to a potentially very large family of small RNAs 
in nematode, fly and human (and presumably other 
organisms) that are being called the miRNAs. Nelson 
Lau et al 16 produced and sequenced a C. elegans cDNA 
library that was cleverly enriched for tiny RNAs with 
5'-phosphate and 3'-hydroxy termini, and obtained 55 
new miRNAs. Lee and Ambros used a size-selected 
C. elegans cDNA library and, to a lesser extent, a com- 
putational approach, to look for conserved sequences 
in Caenorhabditis briggsae that can be folded into a 
stem similar to the lin-4 and let-7 precursors, and 
found 15 miRNAs 15 . Mariana Lagos- Quintana et al" 
used size-selected cDNA libraries in human and 
Drosophila to isolate 33 miRNAs — 19 in humans and 
14 in Drosophila. 

In total, 9 1 different miRNAs have been identified so 
far in the three species. Some of these are highly con- 
served in evolution, such as let-7 y and homologues of 1 1 
miRNAs are found in more than one of the three 
species. Northern blot analyses have been done for 
many of these RNAs, and generally show both a 2 1—24- 
nudeotide form (presumably the active miRNA) and, 
often, a less abundant -70- nucleotide form (presum- 
ably the precursor stem-loop). The miRNA genes are 
often clustered in the genome 1 * 17 and might be co- 
expressed in polycistronic precursor transcripts. Many 
of the miRNAs were identified by single cDNA 
sequences, so it is clear that none of these screens are 
near saturation. 

It seems that miRNAs are more likely to function as 
translational repressors like lin-4, not as siRNAs in 
directing mRNA degradation. Like lin-4, but unlike 
siRNAs, miRNAs are produced asymmetrically from 
the precursor stem and, almost invariably, only one 
strand of the precursor stem can be recovered as a 
21-24-nucleotide product, although the Bartel lab 
reports a single exception 16 (FIG. 3). Many miRNAs are 
produced in a stage- and/or tissue-specific manner, 
indicating possible roles in development akin to the 
stRNAs. Some of the C. elegans miRNAs are specifically 
expressed in the germ line and embryo, in which trans- 
lational regulation is particularly prevalent 16 . If the par- 
allels with lin-4 hold up, the miRNAs should be expect- 
ed to direct translational repression by binding to one 
or more sites with imperfect complementarity in the 
3' UTRs of coding mRNAs (FIG.2). 

Another puzzling observation about RNAi now 
seems to make more sense. Some of the genes impli- 
cated in the RNAi-processing pathway have lethal 
phenotypes or show developmental defects when 
knocked out, which does not make sense if they are 
functioning solely in RNAi and as an anti-virus 
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Figure 4 1 Example of an Escherichia coti riboregulator. a | The dsrA RNA Is proposed to have a three-stem secondary 
structure on its own, but b | unzips to interact by base pairing with the transiational start site of several different mRNAs, including 
rpoS (the shine/dalgarno and AUG of the start site are boxed). In some mRNAs, DsrA blocks the ribosome-binding site and acts 
as a transiational repressor. For rpoS, DsrA acts as a translationai activator, which is proposed to happen by competition with an 
occluding secondary structure as shown in b; base pairs that would be broken by interaction with bases in DsrA (shown in red), 
freeing the start site, are shown with bold, open circles, nt, nucleotide. Redrawn with permission from REE 99. 



SHINE/DALGARNO SEQUENCE 
A consensus sequence 
recognized during transiational 
initiation by Escheridiia coli 
ribosomes. 



defence mechanism 88,90,91 . In C. elegans, knockdowns 
of some genes in the rde-l/ARGONAUTE/piwigene 
family, such as rde-1 itself, produce RNAi-defective 
phenotypes but no developmental phenotypes, 
whereas knockdowns of others, such as alg-1 and 
a(g-2, produce developmental phenotypes but are still 
RNAi sensitive 96 . Therefore, it seems that in addition 
to the RNAi effect itself, components of the RNAi- 
processing pathway also function in developmental 
regulatory processes that might involve numerous 
endogenous miRNAs. 

Even E. co// has been hiding something 

The bacterium E. coli is arguably the best- studied 
organism. The complete genome sequence of E. coli 
K-12 contains an estimated 4,200 protein-coding 
genes 97 . The small number of known ncRNA genes has 
continued to climb slowly, as several small, stable RNAs 
have been reported anecdotally 98 . Many of these seem to 
be 'riboregulators' 99,100 , which act by using base comple- 
mentarity to specifically interact with transiational start 
sites and either repress or, more rarely, activate transla- 
tion (FIG. 4). The recent availability of comparative 
genome sequence information from Salmonella spp. 
and other enterobacteria made it possible to go looking 
for conserved sequences that might correspond to 
ncRNAs, instead of coding ORFs. 

Liron Argaman and colleagues computationally 
analysed intergenic regions to identify loci that have a 
predicted promoter and terminator spaced 50-400 
nucleotides apart, and are significantly conserved in 
other bacterial genomes 18 . They predicted 24 candidate 
ncRNA genes, 14 of which were shown by northern 
blot analysis to produce discrete, small transcripts rang- 
ing from 70 to 250 nucleotides in size. Many of the 
RNAs were expressed in conditions other than 'normal' 



exponential growth in rich media: four RNAs were only 
expressed in stationary phase; five were preferentially 
expressed in stationary phase; one was expressed 
almost exclusively in cold-shocked cells; and one was 
preferentially expressed in minimal media. 

Karen Wassarman et oi 20 looked for intergenic regions 
that showed conservation to other genomes. They ranked 
these candidate regions using further criteria, such as the 
separation of the conserved region from nearby ORFs, 
the presence of plausible promoter and terminator sig- 
nals, and significant RNA expression detected on whole- 
genome high-density oligo-probe arrays. They predicted 
59 candidate ncRNA loci, 17 of which were shown by 
northern blot analysis to produce discrete, small RNA 
transcripts ranging from 45 to 320 nucleotides in size. 
Again, several of these were shown to be expressed almost 
exclusively in stationary phase cells. Seven of these RNAs 
could be immunoprecLpitated with antisera to the abun- 
dant RNA-binding protein Hfq, which binds two previ- 
ously known ncRNAs in E coli — axyS and dsrA 

Elena Rivas and co-workers have developed a general 
ncRNA gene-finding algorithm 101 . The algorithm uses 
comparative genome sequence analysis to detect con- 
served sequence regions in which the pattern of muta- 
tion is more consistent with conservation of a base- 
paired secondary structure than with conservation of an 
amino-acid coding sequence or with a null hypothesis 
of uncorrelated position -independent mutation. The 
approach is therefore limited to detecting only ncRNAs 
with conserved intramolecular secondary structure. The 
algorithm was used to screen the E coli genome, and it 
detected 275 candidate loci 19 . A sample of 49'of these 
loci was analysed by northern blot analysis in a single 
growth condition (exponential growth in rich media), 
and 1 1 were found to produce small RNAs, ranging 
from 40 to 370 nucleotides in size. 
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In total, three systematic screens have identified 34 
new ncRNA transcripts in E. coli* of as yet unknown 
function. There is little overlap in the confirmed tran- 
scripts (only eight were confirmed by more than one of 
the screens). This indicates that these screens have not 
saturated the E coli genome for new ncRNAs. Of the 27 
genes confirmed by one or both of the screens carried 
out by Argaman et al. and Wassarman etal.,21 are in 
the candidate list proposed by Rivas and colleagues, 
indicating that the sensitivity of the computational gene 
finder is fairly high. The experimental characterization 
done by Argaman et al and Wassarman et al shows that 
many ncRNAs are being expressed in specific growth 
conditions, something that had already been seen for 
known £ coli ncRNAs; for instance, for the oxyS RNA 
(expressed in oxidatively stressed cells) 102 or the csrB 
RNA (expressed in stationary-phase cells) 103 . This indi- 
cates that the examination of a single growth condition 
by Rivas et al was insufficient, and shows that confirm- 
ing the expression of a candidate ncRNA gene is not 
necessarily straightforward 

How many new ncRNAs is £ coli still hiding? 
Simulation studies of the false-positive rate in the study 
by Rivas et al. indicate that 200 or more of the 275 gene 
predictions should be real ncRNAs (or more precisely, 
biologically relevant sequences conserving an RNA 
structure; the approach cannot easily distinguish cis- 
regulatory RNA structures from independent ncRNA 
genes) 19 . Wassarman and colleagues proposed that it 
would be unlikely that more than 50 new ncRNAs 
would be found in £ coif 0 . A fourth screen, using a sin- 
gle-sequence neural-net-based computational gene- 
finding approach in £ coli, predicted 370 sequence win- 
dows to be ncRNA genes (because the windows could 
overlap, this means a somewhat smaller number of. 
RNA gene loci) 104 . These predictions have yet to be 
experimentally verified, and the amount of overlap with 
the other screens needs to be examined 

Matters arising 

Many genes, little genetics. On the one hand, we have 
genomic screens that are unsaturated and must provide 
just a taste of larger numbers of ncRNA genes to come. 
On the other hand, if there were many ncRNA genes, 
one would think that they should have been detected 
sooner in classical genetic screens. (Most biochemical, 
computational and molecular biology gene-discovery 
approaches make strong assumptions about finding 
proteins, ORFs and mRNA, so it is easier to rationalize 
their failure to detect ncRNAs.) There are some biases 
even in classical genetics, though. Strikingly, few of the 
known ncRNA genes have been identified by genetics. 
For example, none of the known £ coli small RNAs 
have been identified by mutational screens 20,98 . RNA 
genes are immune to frameshift or nonsense mutations, 
and are often small and multicopy, which makes them 
difficult (even impossible) targets for recessive muta- 
tional screens. 

An interesting extra source of bias is introduced in 
going from a mapped genetic locus to a cloned gene. 
Especially in more complex systems, candidate gene 



identification is often essential for pinpointing a mutant 
locus, but candidate gene identification is biased 
towards ORFs and coding genes. Maaret Ridanpaa and 
colleagues recently provided an excellent example in 
human genetics. (Cartilage-hair hypoplasia (CHH) — a 
short-limbed dwarfism — was first described by Victor 
McKusick almost 30 years ago 105 . Positional cloning 
failed to identify the gene despite straightforward and 
accurate genetic mapping 106,107 . Ridanpaa etal finally 
increased the resolution of the genetic map by almost an 
order of magnitude and sequenced the entire human 
genomic region. All ten identifiable protein-coding 
genes were studied, with no luck. CHH -associated 
mutations were at last discovered in the 267-nucleotide 
RMRP ncRNA gene, which produces the essential RNA 
component of the ribonudeoprotein endoribonuclease 
MRP (MRP stands for mitochondrial RNA process- 
ing) 108 . The only reason RMRPvfzs considered as a can- 
didate gene was that human MKPRNA had previously 
been isolated biochemically 108 and its sequence was in 
GenBank Otherwise, RidanpM and co-workers might 
still be looking. 

One other human genetic disorder has been mapped 
to a nuclear-encoded ncRNA candidate gene by posi- 
tional cloning — autosomal-dominant dyskeratosis 
congenita patients have mutations in telomerase 
RNA 109 . Here again, telomerase RNA was already in the 
GenBank database and, moreover, it was an obvious 
candidate gene, as an X-Hnked dyskeratosis had already 
been associated with dyskerin — a protein known to 
interact with telomerase RNA. 

The power of comparative analysis. It is difficult to dis- 
tinguish coding genes with short ORFs from ncRNA 
genes. Many sequences have long ORFs and are obvi- 
ously coding, but for others, coding potential is less con- 
vincing. Protein-coding regions as small as seven amino 
acids in length are known 110 . ORFs greater than 100 
amino acids can occur just by chance in completely ran- 
dom sequence; it has been argued that 10-15% of anno- 
tated ORFs in microbial genomes are in fact spurious 111 . 
ORF length and 'coding potential* alone is therefore 
often insufficient to decide whether a gene is coding or 
non-coding. Errors are being made in both directions. 
The 360-nudeotide bacterial regulatory ncRNA CsrB 103 
was originally misannotated as a 47-amino-acid pro- 
tein, because that was the ORF closest to several 
mapped mutations' 12 ; the erroneous 'protein* sequence 
is still in GenBank (Erwinia carotovora aepH, 
AAB32243.1). Conversely, the plant {Nfedicago) 
ENOD40 gene was first thought to be an ncRNA gene 
on the basis of sequence analysis that showed "no signif- 
icant coding potential'* 113 . Now, on the basis of compar- 
ative genome analysis and more detailed, directed muta- 
genesis studies, the ENOD40 transcript seems to encode 
two tiny proteins that are 13 and 27 amino adds long' 

Comparative genome analysis is an indispensable 
means of inferring whether a locus produces a ncRNA 
as opposed to encoding a protein. For a small gene to be 
called a protein-coding gene, one excellent line of evi- 
dence is that the ORF is significantly conserved in 
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PURIFYING SELECTION 
A common form of 
evolutionary change in which a 
mutation is harmful, and 
therefore disappears from the 
population. 

NEUTRAL DRIFT 

The process by which DNA 

sequence acquires many 

mutations over time that have 

no phenotypic effect, and hence 

are not acted on by Darwinian 

selection. 

POSITIVE SELECTION 
A rare form of evolutionary 
change in which a mutation 
seems to be favoured because it 
is fixed in the population at a 
rate even greater than predicted 
by neutral drift. 

NEURAL NETWORK 

A popular machine learning 

method that is often used for 

automatic classification of 

biological sequences, based on 

'training* on a set of known 

examples. 

SM PROTEIN 

An RNA-btnding protein 
recognized by antibodies that 
are produced by people with 
certain autoimmune diseases; 
'Sm' stands for 'Smith', the name 
of a patient 



another related species. For almost all protein-coding 
genes (those undergoing purifying selection or neutral 
drift, but perhaps not those under positive selection), the 
pattern of mutation should also favour synonymous 
and conservative amino-acid changes. Comparative 
analysis has been instrumental in many cases of distin- 
guishing ncRNA genes from small protein-coding 
genes, including the examples above. It is more difficult 
to positively corroborate a ncRNA by comparative 
analysis but, in at least some cases, a ncRNA might con- 
serve an intramolecular secondary structure and com- 
parative analysis can show compensatory base substitu- 
tions 19,101 . With comparative genome sequence data now 
accumulating in the public domain for most if not all 
important genetic systems, comparative analysis can 
(and should) become routine. 

Discovering new non-coding RNA genes. There are now 
three main lines of attack for systematically identifying 
new ncRNA genes. First, computational comparative 
genome analysis seems to be a very powerful approach. 
All three ncRNA screens in £ coli exploited comparative 
analysis 18 " 20 , as did one of the screens for new miRNAs 
in C. elegans 15 . These approaches range in complexity 
from BLASTN screens that identify conserved regions 
that do not correspond to apparent ORFs, to identifying 
regions that conserve some particular type of RNA 
structure (such as the miRNA precursor stem), to a gen- 
eral ncRNA gene-finding program looking for any sig- 
nificant conserved intramolecular secondary struc- 
ture 101 . Previous attempts to develop ncRNA 



gene-finders that work on a single genome sequence 
have been stymied by the apparent lack of much signifi- 
cant statistical signal in ncRNAs 1 15,1 16 , compared with 
the strong ORF and codon bias signals exploited by pro- 
tein-coding gene-finders. However, an apparently suc- 
cessful single-sequence RNA gene-finder, using a neural 
network approach, has recently been reported 104 , and it 
might also be possible to identify untranslated, spliced 
ncRNAs by the computational identification of clus- 
tered splice-site signals 117 . 

Second, cDNA cloning strategies that are specifically 
designed to enrich for ncRNAs have been very fruitful. 
The most obvious enrichment strategy is simply to done 
and sequence small RNAs from total RNA (as opposed 
to the usual selection of large, cytoplasmic, polyadenylat- 
ed mRNA for cDNA cloning and EST sequencing) 14 . 
Enrichment by immunoprecipitation with antisera 
against proteins that associate with specific families of 
ncRNAs is another strategy that has been used for 
decades; examples include the isolation of snRNAs using 
anti-SM autoantibodies 35 and isolation of C/D snoRNAs 
using anti-fibrillarin sera 71 . Some ncRNAs can be 
enriched by virtue of 5' ends that differ from the 'nor- 
mal* mRNA cap; intronic snoRNAs and miRNAs, for 
example, have simple 5' phosphates that are substrates 
for RNA ligase 16,56 . Enrichment by exploiting the subcel- 
lular localization of ncRNAs can also be useful, as in the 
isolation of snoRNAs from cDNA libraries made from 
purified nucleoli 56 . There must be other clever enrich- 
ment schemes. Unenriched public EST and cDNA 
sequence libraries can also be mined for transcripts that 
lack significant ORFs, although at some danger of being 
confused by small ORFs, frameshift sequencing errors, 
or long UTRs of mRNAs. 

Third, it should be possible, in principle, to detect 
new transcripts (both ncRNA and protein-coding RNA) 
using high-density oligonucleotide^microarrays that sys- 
tematically probe an entire genome, rather than just 
probing expression of known and predicted protein- 
coding genes. However, experience with E. coli whole- 
genome chips has been variable. Successful detection of 
some known ncRNAs has been reported anecdotally 118 ; 
but in systematic use, such data have proved to be more 
useful as corroboration rather than a primary screen 20 . 1 
would expect these data to become more useful as 
microarray technology continues to improve. 

The modern RNA world 

The discovery of RNA catalysis 119 ' 120 and the "RNA 
world" hypothesis for the origin of life 26121 provide a 
seductive explanation for why rRNA and tRNA are at 
the core of the translation machinery: perhaps they are 
the frozen evolutionary relic of the invention of the 
ribosome by an RNA-based riboorganism' 122 . Other 
known ncRNAs have also been proposed to be ancient 
relics of the last riboorganisms 123 - 125 . The romantic idea 
of uncovering molecular fossils of a lost RNA world has 
motivated searches for new ncRNAs. However, as these 
searches start to succeed, more and more ncRNAs are 
being found to have apparently well- adapted, special- 
ized biological roles. The idea that ncRNAs are a small 
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and ragged band of relics looks increasingly untenable. 
The tiny stRNAs and miRNAs, for example, seem to be 
highly adapted for a world in which RNAi processing 
and developmental^ regulated mRNA targets exist 

Therefore, consider an alternative idea — the "mod- 
ern RNA world". Many of the ncRNAs we see in fact 
have roles in which RNA is a more optimal material 
than protein. Non-coding RNAs are often (though not 
always) found to have roles that involve sequence- 
specific recognition of another nucleic acid (The choice 
of examples in FIGS 1,2 and 4 is deliberate, showing how 
snoRNAs, miRNAs and £ colt riboregulatory RNAs all 
function by sequence-specific base complementarity.) 
RNA, by its very nature, is an ideal material for this role. 
Base complementarity allows a very small RNA to be 
exquisitely sequence specific. Evolution of a small, spe- 
cific complementary RNA can be achieved in a single 
step, just by a partial duplication of a fragment of the 
target gene into an appropriate context for expression of 
thenewncRNA. 

Many functional roles do not require the more 
sophisticated catalytic prowess of proteins and could be 
carried out by simple RNAs. Post-transcriptional regu- 
lation, in particular, can be achieved simply by steric 
occlusion of sites on a target pre-mRNA or mature 



RNA. In cases requiring more sophistication than sim- 
ple steric blockage, necessary catalytic functions can be 
delegated to a small number of shared proteins, whereas 
specific sequence recognition functions are carried out 
by a horde of individual small RNAs that interact with 
these proteins. John Morrissey and David Tollervey have 
proposed that modification-guide snoRNAs arose in 
this way, as a more modular system that replaced a 
smaller number of site-specific protein methylases and 
pseudou^idylases ,26 . 

The idea that ncRNA would be well adapted for reg- 
ulatory roles is not new 35 * 50,127 . In the process of defining 
many of the concepts of molecular genetics, including 
mRNA and operons, Fran 90 is Jacob and Jacques 
Monod distinguished "structural genes" (such as lacZ) 
from "regulatory genes" (such as lacl) m . At that time, 
regulators such as foe/had only been defined genetically, 
and they were known to specifically interact with cis- 
acting sequences (such as lacO), either at the DNA or 
mRNA level Jacob and Monod reasoned that base com- 
plementarity would allow RNA to interact highly specif- 
ically with other nudeic-acid sequences. They proposed 
that structural genes encoded proteins, and regulatory 
genes produced ncRNAs (fig. 5). Forty years later, their 
proposal is looking more relevant than ever. 
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